“From Multivariate to Longitudinal Data”
April 11, 2017
> dim(deng)
[1] 17585 255
> sum(deng == 0) / prod(dim(deng))
[1] 0.5019552
> head(sample(colnames(deng)))
[1] "earlyblast" "16cell" "4cell" "midblast" "lateblast" "16cell"
> head(sample(rownames(deng)))
[1] "Gm7073" "Mir697" "Uqcrc2" "Ap2m1" "Slc10a3" "Ccr1"
\( \forall \) gene \( g \), fit OLS Regression with known timepoint \( t \) per cell–
\[ \log_2(g +1) \sim \beta_0 + \beta_1 t \]
\( \forall \) gene \( g \), fit regression with permuted timepoint \( t^* \) per cell
\[ \log_2(g +1) \sim \beta_0 + \beta_1 t^* \]
\( \forall \) gene \( g \), fit regression with permuted factor timepoint \( t^{**} \)
\[ \log_2(g +1) \sim \beta_0 + \beta_1 t^{**} \]
Given a matrix \( \mathbf{Y} \) of \( D \) genes (features) by \( n \) samples, determine a latent vector \( \mathbf{P} \) with dimension \( 1 \) x \( n \) that reflects the developmental trajectory of the \( n \) cells from the variance in \( D \) genes.
> cor(runif(length(time)) %>% sort(), as.numeric(time) %>% sort())^2
[1] 0.8799669
\( \mathbf{Y} \) is a \( D \) x \( n \) matrix. Compute the covariance matrix–
\[ \mathbf{\Sigma} = E(\mathbf{YY^{T}}) - \mathbf{\mu \mu^T} \]
where \( \mathbf{\mu} = E(\mathbf{Y}) \)
Then compute the spectral decomposition of \( \mathbf{\Sigma} \)
\[ \mathbf{\Sigma} a_j = \lambda_j a_j \]
for \( j \in (1, ..., D) \)
Then \( a_j \) represent the eigenvectors of the data matrix \( \mathbf{Y} \).
Note the ordering of \( j \) is meaningful–
\[ \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D \geq 0 \]
> irlba::prcomp_irlba()
> prcomp()
> princomp()
> cor(pca$rotation[,1], as.numeric(time))^2
[1] 0.5933009
PCA by itself isn't satisfactory…
Ideas for improvements?
1) Quantifiying uncertainty
2) Non-linear latent variable inference
Next several images taken from slides via Guy Wolf (Yale)
These slides can be found here
\( \mathbf{Y} \) is a \( D \) x \( n \) matrix. \( \mathbf{x} \) is a \( d \) x \( n \) matrix (\( d < D \)). Want to relate \( \mathbf{x} \) and \( \mathbf{Y} \) and assume that the relationship is linear–
\[ \mathbf{Y = Wx} + \epsilon \]
where \( W \) is a \( D \) x \( d \) matrix.
By convention (after locating the matrix),
\[ \mathbf{x} \sim \mathcal{N} (0, \mathbf{I}) \]
where \( \mathbf{I} \) is the identity matrix of dimension \( d \) x \( d \).
Specify the error, \[ \epsilon \sim \mathcal{N} (0, \mathbf{\Psi}) \]
where we assume \( \mathbf{\Psi} \) to be a diagonal matrix \( D \) x \( D \).
Then \( Y_i \) for \( i \in (1, ..., n) \),
\[ Y_i |x_i \sim \mathcal{N} (0, \mathbf{\Psi}) \]
Then we can integrate out the latent variables–
\[ p( Y_i |\mathbf{W}, \mathbf{\Psi})= \int p( Y_i | x_i , \mathbf{W, \Psi}) p(x_i) dx_i \]
Under this specification, we can write the MVN distribution for our observations–
\[ \mathbf{Y} \sim \mathcal{N} (0, \mathbf{C}), \mathbf{C} = \mathbf{WW^T }+ \mathbf{\Psi} \]
Young-Whittle Factor Analysis (1950s)
\( \psi_i \) element of the diagonal of \( \Psi \); add constraint \( \psi_i = \sigma^2 \)
Assume \( \sigma^2 \) known, MLE yields same \( W \) as PCA
\[ \mathcal{L} = \tfrac{-N}{2}\{ D\log(2\pi) + \log|\mathbf{C}| + \textrm{tr}(\mathbf{C}^{-1}\mathbf{\Sigma}) \} \]
\[ \mathbf{C} = \mathbf{WW^T }+ \mathbf{\Psi} \]
\[ \mathbf{x} = \mathbf{C^{-1}W^{T}Y} \]
For \( \mathbf{W_{\textrm{MLE}}} \),
\[ \sigma_{MLE}^2 = \dfrac{1}{D-d} \sum_{j = d +1}^D \lambda_j \]
library(pcaMethods)
pca()
resPPCA <- pca(data, method="ppca", center=FALSE, nPcs=5)
resBPCA <- pca(data, method="bpca", center=FALSE, nPcs=5)
\[ \mathbf{Y} \sim \mathcal{N} (0, \mathbf{C}), \mathbf{C} = \mathbf{WW^T }+ \mathbf{\Psi} \]
\[ \mathbf{C}(\mathbf{W_i},\mathbf{W_j}) = \mathbf{W_i^T}\mathbf{W_j} + \sigma^2\delta_{ij} \]
\[ \mathbf{C}(\mathbf{W_i},\mathbf{W_j}) = \theta_{rbf}\exp\Big(\dfrac{-\gamma}{2}(\mathbf{W_i}-\mathbf{W_j})^{T}(\mathbf{W_i}-\mathbf{W_j}) \Big) + ... \]
library(pseudogp)
fit <- fitPseudotime(data, smoothing_alpha = 30, smoothing_beta = 6,
iter = 1000, chains = 1)
posteriorBoxplot(fit)
> cor(gplvm_means, as.numeric(time))^2
[1] 0.783522
> cor(pca$rotation[,1], as.numeric(time))^2
[1] 0.5933009
> cor(gplvm_means, as.numeric(time))^2
[1] 0.783522